Scalable Partitioning and Exploration of Chemical Spaces Using Geometric Hashing

نویسندگان

  • Debojyoti Dutta
  • Rajarshi Guha
  • Peter C. Jurs
  • Ting Chen
چکیده

Virtual screening (VS) has become a preferred tool to augment high-throughput screening(1) and determine new leads in the drug discovery process. The core of a VS informatics pipeline includes several data mining algorithms that work on huge databases of chemical compounds containing millions of molecular structures and their associated data. Thus, scaling traditional applications such as classification, partitioning, and outlier detection for huge chemical data sets without a significant loss in accuracy is very important. In this paper, we introduce a data mining framework built on top of a recently developed fast approximate nearest-neighbor-finding algorithm(2) called locality-sensitive hashing (LSH) that can be used to mine huge chemical spaces in a scalable fashion using very modest computational resources. The core LSH algorithm hashes chemical descriptors so that points close to each other in the descriptor space are also close to each other in the hashed space. Using this data structure, one can perform approximate nearest-neighbor searches very quickly, in sublinear time. We validate the accuracy and performance of our framework on three real data sets of sizes ranging from 4337 to 249 071 molecules. Results indicate that the identification of nearest neighbors using the LSH algorithm is at least 2 orders of magnitude faster than the traditional k-nearest-neighbor method and is over 94% accurate for most query parameters. Furthermore, when viewed as a data-partitioning procedure, the LSH algorithm lends itself to easy parallelization of nearest-neighbor classification or regression. We also apply our framework to detect outlying (diverse) compounds in a given chemical space; this algorithm is extremely rapid in determining whether a compound is located in a sparse region of chemical space or not, and it is quite accurate when compared to results obtained using principal-component-analysis-based heuristics.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable Data Parallel Object Recognition Using Geometric Hashing on Cm-5

In this paper, we present scalable parallel algorithms for object recognition using geometric hashing. We deene an abstract model of CM-5. We develop a load-balancing technique that results in scalable processor-time optimal algorithms for performing a probe on the CM-5 model. Given a model of CM-5 with P PNs and a set S of feature points in a scene, a probe of the recognition phase can be perf...

متن کامل

Scalable Data Parallel Implementations of Object Recognition Using Geometric Hashing

Object recognition involves identifying known objects in a given scene. It plays a key role in image understanding. Geometric hashing has been proposed as a technique for model-based object recognition in occluded scenes. However, parallel techniques are needed to realize real time vision systems employing geometric hashing. In this paper, we present scalable parallel algorithms for object reco...

متن کامل

Fast Bayesian Shape Matching Using Geometric Algorithms

We present a Bayesian approach to comparison of geometric shapes with applications to classification of the molecular structures of proteins. Our approach involves the use of distributions defined on transformation invariant shape spaces and the specification of prior distributions on bipartite matchings. Here we emphasize the computational aspects of posterior inference arising from such model...

متن کامل

Scalable Packet Classification through Maximum Entropy Hashing

In this paper we propose a new packet classification algorithm, which can substantially improve the performance of a classifier by decreasing the rulebase lookup latency. The algorithm hierarchically partitions the rulebase into smaller independent sub-rulebases by employing hashing. By using the same hash key used in the partitioning a classifier only needs to look up the relevant sub-rulebase...

متن کامل

Compressed Image Hashing using Minimum Magnitude CSLBP

Image hashing allows compression, enhancement or other signal processing operations on digital images which are usually acceptable manipulations. Whereas, cryptographic hash functions are very sensitive to even single bit changes in image. Image hashing is a sum of important quality features in quantized form. In this paper, we proposed a novel image hashing algorithm for authentication which i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of chemical information and modeling

دوره 46 1  شماره 

صفحات  -

تاریخ انتشار 2006